392 research outputs found
Modeling of Speech Parameter Sequence Considering Global Variance for HMM-Based Speech Synthesis
Speech technologies such as speech recognition and speech synthesis have many potential applications since speech is the main way in which most people communicate. Various linguistic sounds are produced by controlling the configuration of oral cavities to convey a message in speech communication. The produced speech sounds temporally vary and ar
Collapsed speech segment detection and suppression for WaveNet vocoder
In this paper, we propose a technique to alleviate the quality degradation
caused by collapsed speech segments sometimes generated by the WaveNet vocoder.
The effectiveness of the WaveNet vocoder for generating natural speech from
acoustic features has been proved in recent works. However, it sometimes
generates very noisy speech with collapsed speech segments when only a limited
amount of training data is available or significant acoustic mismatches exist
between the training and testing data. Such a limitation on the corpus and
limited ability of the model can easily occur in some speech generation
applications, such as voice conversion and speech enhancement. To address this
problem, we propose a technique to automatically detect collapsed speech
segments. Moreover, to refine the detected segments, we also propose a waveform
generation technique for WaveNet using a linear predictive coding constraint.
Verification and subjective tests are conducted to investigate the
effectiveness of the proposed techniques. The verification results indicate
that the detection technique can detect most collapsed segments. The subjective
evaluations of voice conversion demonstrate that the generation technique
significantly improves the speech quality while maintaining the same speaker
similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201
Analysis of Noisy-target Training for DNN-based speech enhancement
Deep neural network (DNN)-based speech enhancement usually uses a clean
speech as a training target. However, it is hard to collect large amounts of
clean speech because the recording is very costly. In other words, the
performance of current speech enhancement has been limited by the amount of
training data. To relax this limitation, Noisy-target Training (NyTT) that
utilizes noisy speech as a training target has been proposed. Although it has
been experimentally shown that NyTT can train a DNN without clean speech, a
detailed analysis has not been conducted and its behavior has not been
understood well. In this paper, we conduct various analyses to deepen our
understanding of NyTT. In addition, based on the property of NyTT, we propose a
refined method that is comparable to the method using clean speech.
Furthermore, we show that we can improve the performance by using a huge amount
of noisy speech with clean speech.Comment: Submitted to ICASSP 202
Evaluating Methods for Ground-Truth-Free Foreign Accent Conversion
Foreign accent conversion (FAC) is a special application of voice conversion
(VC) which aims to convert the accented speech of a non-native speaker to a
native-sounding speech with the same speaker identity. FAC is difficult since
the native speech from the desired non-native speaker to be used as the
training target is impossible to collect. In this work, we evaluate three
recently proposed methods for ground-truth-free FAC, where all of them aim to
harness the power of sequence-to-sequence (seq2seq) and non-parallel VC models
to properly convert the accent and control the speaker identity. Our
experimental evaluation results show that no single method was significantly
better than the others in all evaluation axes, which is in contrast to
conclusions drawn in previous studies. We also explain the effectiveness of
these methods with the training input and output of the seq2seq model and
examine the design choice of the non-parallel VC model, and show that
intelligibility measures such as word error rates do not correlate well with
subjective accentedness. Finally, our implementation is open-sourced to promote
reproducible research and help future researchers improve upon the compared
systems.Comment: Accepted to the 2023 Asia Pacific Signal and Information Processing
Association Annual Summit and Conference (APSIPA ASC). Demo page:
https://unilight.github.io/Publication-Demos/publications/fac-evaluate. Code:
https://github.com/unilight/seq2seq-v
A modulation property of time-frequency derivatives of filtered phase and its application to aperiodicity and fo estimation
We introduce a simple and linear SNR (strictly speaking, periodic to random
power ratio) estimator (0dB to 80dB without additional
calibration/linearization) for providing reliable descriptions of aperiodicity
in speech corpus. The main idea of this method is to estimate the background
random noise level without directly extracting the background noise. The
proposed method is applicable to a wide variety of time windowing functions
with very low sidelobe levels. The estimate combines the frequency derivative
and the time-frequency derivative of the mapping from filter center frequency
to the output instantaneous frequency. This procedure can replace the
periodicity detection and aperiodicity estimation subsystems of recently
introduced open source vocoder, YANG vocoder. Source code of MATLAB
implementation of this method will also be open sourced.Comment: 8 pages 9 figures, Submitted and accepted in Interspeech201
- …